What are we doing here?

This markdown tutorial expands on Alex Singleton’s Geodemographics lab from his book Urban Analytics. Most of the text is directly his, but I’ve added some comments to further explain his code and some adjustments to the final map output.

How can I follow along?

The easiest way to run through this lab on your own machine is to clone the data from the Urban Analytics Git Repository. Strangely, this does not always work if you clone the repository straight to your computer. This occasionally leads to a file corruption issue which will tell you that the .RData workspace image has a “bad magic number.” To avoid this experience, you should first download Git For Windows, and clone the repository to your machine from there. You may be able to figure this step out on your own, but if you get stuck check out this tutorial for more help.

Learning Objectives

By the end of this practical lab you will be able to: * Standardize and normalize an input dataset * Use cluster analysis to build a typology * Create scores that can be used to describe the clusters * Build and visualize a grand index table

Geodmeographic methods

There is no single or correct method for building a geodemographic classification. What we present here follows the methodology used to create the UK Output Area Classification, however is applied within a single city context; in this case Liverpool. As such, this process mirrors the creation of other regional classifications, such as the London Output Area Classification.

The inputs to OAC are sourced entirely from the UK 2011 census, and are organised around three domains; demographic, housing and socio-economics. These are then divided into a series of sub-domains comprised of a total of 60 variables. The input variables to OAC are all calculated as percentages against an appropriate denominator, with the exception of a standardized illness ratio and population density. Input data were selected on the basis of maintaining similarity to the OAC developed for the 2001 Census, but also exploiting some of those new variables added by the 2011 census. Such requirements were formulated after the outcome of a national consultation exercise delivered by the ONS.

After the 2011 Census data were assembled and the attribute measures calculated, these were first standardized using an inverse hyperbolic sine function that transforms the attributes more closely to a normal distribution, which is argued as of assistance to clustering algorithms such as k-means given their optimization for finding spherical clusters.

Secondly, prior to clustering, all of the attributes were standardized onto a 0-1 scale using a range standardization method, thus ensuring that each variable had an equal influence on the clustering result. The K-means algorithm was then implemented to cluster the UK into 8 initial clusters referred to as Super Groups. The data were then split by these clusters, and further divided into between 2 and 4 clusters, forming a second level called Groups and comprising 26 clusters in total. A final set of splits created a Sub Group level, comprising a total of 76 clusters. If you are interested in viewing a map of the 2011, this can be found here.

Data Import and Preparation

First we will load a series of tables which includes: “Liverpool”,“Census_2011_Count_All”,“Variable_Desc” and “OAC_Input_Lookup”. Census_2011_Count_All is a data frame containing census data for the UK, these are detailed in “Variable_Desc”. The Liverpool data frame is a list of output areas within Liverpool, UK; however, if you wanted to repeat this analysis for a different city, you could swap this list. Finally, “OAC_Input_Lookup” is a lookup for the variables used to create OAC.

# Load data
# this function `load` reads pre formatted R tables into your environment from the file census_2011_UK_OA.RData -- make sure the file path is correct on your machine :)
load("./data/census_2011_UK_OA.RData")

We will now create a subset of the input data for Liverpool:

#Crop
Census_2011_Count <- merge(Liverpool,Census_2011_Count_All,by="OA",all.x=TRUE)

The first stage in the analysis is to create aggregations for the input data - if we look at the top six rows of the “OAC_Input_Lookup” data frame, the constituent variables for each of the OAC variables is shown in the England_Wales column. Thus for variable (see VariableCode column k001 this is “KS102EW0002”; but for k002, this requires three variables to be aggregated “KS102EW0003,KS102EW0004,KS102EW0005”.

# the function head shows the first few rows of a table, data frame, or several other similar objects
head(OAC_Input_Lookup[,])
##   VariableCode  Type Denominator      SubDomain      Domain
## 1         k001 Count KS102EW0001 Population Age Demographic
## 2         k002 Count KS102EW0001 Population Age Demographic
## 3         k003 Count KS102EW0001 Population Age Demographic
## 4         k004 Count KS102EW0001 Population Age Demographic
## 5         k005 Count KS102EW0001 Population Age Demographic
## 6         k006 Count KS102EW0001 Population Age Demographic
##   VariableDescription                       England_Wales
## 1          Age 0 to 4                         KS102EW0002
## 2         Age 5 to 14 KS102EW0003,KS102EW0004,KS102EW0005
## 3        Age 25 to 44             KS102EW0010,KS102EW0011
## 4        Age 45 to 64             KS102EW0012,KS102EW0013
## 5        Age 65 to 89 KS102EW0014,KS102EW0015,KS102EW0016
## 6     Age 90 and over                         KS102EW0017

We will now write some code that will calculate all of the numerators:

  1. For each row in the OAC_Input_Lookup data frame (i.e OAC input variable)
  2. Create a list of census variables that are needed to create the OAC input variable (i.e. England_Wales column)
  3. Extract data for the variables identified from the census data contained in the Census_2011_Count data frame
  4. Sum the variables and give them the OAC input variable name
#reading the column Census_2011_Count$OA and formatting it as a dataframe
OAC_Input <- as.data.frame(Census_2011_Count$OA)
#renaming the column of the dataframe from Census_2011_Count$OA to just OA
colnames(OAC_Input) <- "OA"

# Loop through each row in the OAC input table
for (n in 1:nrow(OAC_Input_Lookup)){

      # Get the variables to aggregate for the row specified by n
      select_vars <- OAC_Input_Lookup[n,"England_Wales"]
      
      # Create a list of the variables to select
      select_vars <- unlist(strsplit(paste(select_vars),","))
      
      # Create variable name
      vname <- OAC_Input_Lookup[n,"VariableCode"] 
      
      # Creates a sum of the census variables for each Output Area
      tmp <- data.frame(rowSums(Census_2011_Count[,select_vars, drop=FALSE]))
      colnames(tmp) <- vname
      
      # Append new variable to the OAC_Input object
      OAC_Input <- cbind(OAC_Input,tmp)
      
      # Remove temporary objects
      remove(list = c("vname","tmp"))

} # END: Loop through each row in the OAC input table

Although we have included the variables for k035, this is merely for coding simplicity above, as we will calculate the Standardized Illness Ratio later; as such we will remove this from the numerator data frame

#Remove attributes for SIR
OAC_Input$k035 <- NULL

We will now create another data frame containing the denominators

OAC_Input_den <- as.data.frame(Census_2011_Count$OA)
colnames(OAC_Input_den) <- "OA"

# Create a list of unique denominators
# filter out duplicate denominators
den_list <- unique(OAC_Input_Lookup[,"Denominator"])
# filter out empty denominators
den_list <- paste(den_list[den_list != ""])

# Select denominators
OAC_Input_den <- Census_2011_Count[,c("OA",den_list)]

And then merge this with the numerators:

#Merge
OAC_Input <- merge(OAC_Input,OAC_Input_den, by="OA")

Now that we have assembled a data frame of numerators and denominators we can calculate the percentages. In order that we compare the correct numerator an denominator variables we will again use a loop that will run through a list of variable names created as follows:

# Get numerator denominator list where the Type is "Count" - i.e. not ratio
K_Var <- OAC_Input_Lookup[OAC_Input_Lookup$Type == "Count",c(1,3)]
# View top 6 rows
head(K_Var)
##   VariableCode Denominator
## 1         k001 KS102EW0001
## 2         k002 KS102EW0001
## 3         k003 KS102EW0001
## 4         k004 KS102EW0001
## 5         k005 KS102EW0001
## 6         k006 KS102EW0001
# Create an OA list / data frame
OAC_Input_PCT_RATIO <- subset(OAC_Input, select = "OA")

# Loop
for (n in 1:nrow(K_Var)){
  
  num <- paste(K_Var[n,"VariableCode"]) # Get numerator name
  den <- paste(K_Var[n,"Denominator"]) # Get denominator name
  tmp <- data.frame(OAC_Input[,num] / OAC_Input[,den] * 100) # Calculate percentages
  colnames(tmp) <- num
  OAC_Input_PCT_RATIO <- cbind(OAC_Input_PCT_RATIO,tmp) # Append the percentages
  
  # Remove temporary objects
  remove(list = c("tmp","num","den"))
}

The final two variables include density (k007), which we can join from the original census table:

#Extract Variable
tmp <- Census_2011_Count[,c("OA","KS101EW0008")]
colnames(tmp) <- c("OA","k007")

#Merge
OAC_Input_PCT_RATIO <- merge(OAC_Input_PCT_RATIO,tmp,by="OA")

We will now calculate the variable k035 which was the standardized illness rate (SIR) - which needs to be calculated for each subset of the national data (in this case Liverpool):

# Calculate rates of ill people 15 or less and greater than or equal to 65
ill_16_64 <- rowSums(Census_2011_Count[,c("KS301EW0005","KS301EW0006")]) # Ill people 16-64
ill_total <-   rowSums(Census_2011_Count[,c("KS301EW0002","KS301EW0003")]) # All ill people
ill_L15_G65 <- ill_total - ill_16_64 # Ill people 15 or less and greater than or equal to 65

# Calculate total people 15 or less and greater than or equal to 65
t_pop_16_64 <- rowSums(Census_2011_Count[,c("KS102EW0007","KS102EW0008","KS102EW0009","KS102EW0010","KS102EW0011","KS102EW0012","KS102EW0013")]) # People 16-64
t_pop <- Census_2011_Count$KS101EW0001 # All people
t_pop_L15_G65 <- t_pop - t_pop_16_64 # All people 15 or less and greater than or equal to 65

# Calculate expected rate
ex_ill_16_64 <- t_pop_16_64 * (sum(ill_16_64)/sum(t_pop_16_64)) # Expected ill 16-64
ex_ill_L15_G65 <- t_pop_L15_G65 * (sum(ill_L15_G65)/sum(t_pop_L15_G65)) # Expected ill people 15 or less and greater than or equal to 65

ex_ill <- ex_ill_16_64 + ex_ill_L15_G65 # total expected ill people

# Ratio
SIR <- as.data.frame(ill_total / ex_ill * 100) # ratio between ill people and expected ill people
colnames(SIR) <- "k035"

# Merge data
OAC_Input_PCT_RATIO <- cbind(OAC_Input_PCT_RATIO,SIR)

# Remove unwanted objects
remove(list=c("SIR","ill_16_64","ill_total","ill_L15_G65","t_pop_16_64","t_pop","t_pop_L15_G65","ex_ill_16_64","ex_ill_L15_G65","ex_ill"))

We will now apply the two standardization and normalization procedures to the input data (OAC_Input_PCT_RATIO) - these are inverse hyperbolic sine and then range standardization.

Inverse hyperbolic sine transformations can be used to rein in outliers to a certain extent. They are similar to log transformations, but are defined at 0. For more information, check out this rant from Economist Frances Wooley

# Calculate inverse hyperbolic sine
OAC_Input_PCT_RATIO_IHS <- log(OAC_Input_PCT_RATIO[,2:61]+sqrt(OAC_Input_PCT_RATIO[,2:61]^2+1))

# Calculate Range
range_01 <- function(x){(x-min(x))/(max(x)-min(x))} # range function
OAC_Input_PCT_RATIO_IHS_01 <- apply(OAC_Input_PCT_RATIO_IHS, 2, range_01) # apply range function to columns. This changes all column values to be between 0 and 1

# Add the OA codes back onto the data frame as row names
rownames(OAC_Input_PCT_RATIO_IHS_01) <- OAC_Input_PCT_RATIO$OA

Estimating the number of clusters

You have now created a subset of 1584 Output Areas for the extent of Liverpool with inputs that have mirrored the attributes, measures, transformation and standardization methods used for the UK OAC 2011 classification. Prior to clustering this bespoke Liverpool classification, it is worth considering what would be an appropriate number of clusters for the initial Super Group (most aggregate) level.

This can be considered by running some test cluster analysis with different cluster frequency (k), and for each result, examining a statistic called the total within sum of squares. This is a measure of how well the cluster frequency fits the data - essentially the mean standardized distance of the data observations to a cluster mean. These tests are typically plotted, with the purpose to identify an “elbow criterion” which is a visual indication of where an appropriate cluster frequency might be set. The trade off you are looking for is the impact of increasing cluster frequency (i.e. making a more complex classification) versus a reduction in this score.

#loading library used to  make the plots
library(ggplot2)

# Create a new empty numeric object to store the wss results
wss <- numeric()

# Run k means for 2-12 clusters and store the wss results
for (i in 2:12) wss[i] <- sum(kmeans(OAC_Input_PCT_RATIO_IHS_01, centers=i,nstart=20)$withinss)

# Create a data frame with the results, adding a further column for the cluster number
wss <- data.frame(2:12,wss[-1])

# Plot the results
names(wss) <- c("k","Twss")
ggplot(data=wss, aes(x= k, y=Twss)) + geom_path() + geom_point() + scale_x_continuous(breaks=2:12) + labs(y = "Total within sum of squares")

You will see that there are no large decreases in the within sum of squares, and a minor moderation of the decrease around 7 or 8 clusters; which also mirrors similar patterns observed within UK OAC. As such, a 7 cluster solution was chosen.

Building the geodemographic

Now that we have chosen seven clusters, we will consider how the partitioning can be optimized. Clustering uses the kmeans() function, which accepts a dataset input - in this case - OAC_Input_PCT_RATIO_IHS_01; the centres, which are the number of clusters, the iter.max which should be just set to a large number to allow the algorithm to complete, and nstart, which is the number of times the cluster analysis is run. It is common to set this to around 10,000 when building geodemographics, although, this will take quite a while to complete. This is necessary given that kmeans is stochastic.

There is no need to run the following; and instead we will load a pre-run object for these settings:

#this code generates the data described above and loaded in the next chunk. Run it if you're interested, but strongly consider lowering nstart to cut down run time
cluster_7 <- kmeans(x=OAC_Input_PCT_RATIO_IHS_01, centers=7, iter.max=1000000, nstart=10000)
# Load cluster object
load("./data/cluster_7.Rdata")

The cluster_7 object contains a list of different outputs related to the cluster analysis- we can view these:

# Show object content
str(cluster_7)
## List of 9
##  $ cluster     : Named int [1:1584] 7 5 7 5 5 7 5 1 1 4 ...
##   ..- attr(*, "names")= chr [1:1584] "E00032987" "E00032988" "E00032989" "E00032990" ...
##  $ centers     : num [1:7, 1:60] 0.553 0.584 0.677 0.666 0.391 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:7] "1" "2" "3" "4" ...
##   .. ..$ : chr [1:60] "k001" "k002" "k003" "k004" ...
##  $ totss       : num 2827
##  $ withinss    : num [1:7] 286 308 250 255 159 ...
##  $ tot.withinss: num 1635
##  $ betweenss   : num 1192
##  $ size        : int [1:7] 259 340 279 334 109 73 190
##  $ iter        : int 6
##  $ ifault      : int 0
##  - attr(*, "class")= chr "kmeans"

The cluster results can therefore be accessed as follows:

# Lookup Table
lookup <- data.frame(cluster_7$cluster)
# Add OA codes
lookup$OA <- rownames(lookup)
colnames(lookup) <- c("K_7","OA")
# Recode clusters as letter
lookup$SUPER <- LETTERS[lookup$K_7]

We will also look at the distribution of these clusters:

table(lookup$K_7)
## 
##   1   2   3   4   5   6   7 
## 259 340 279 334 109  73 190

And then map them:

# Load packages
library(rgdal)
## Loading required package: sp
## rgdal: version: 1.4-8, (SVN revision 845)
##  Geospatial Data Abstraction Library extensions to R successfully loaded
##  Loaded GDAL runtime: GDAL 2.2.3, released 2017/11/20
##  Path to GDAL shared files: C:/Users/isaac/Documents/R/win-library/3.5/rgdal/gdal
##  GDAL binary built with GEOS: TRUE 
##  Loaded PROJ.4 runtime: Rel. 4.9.3, 15 August 2016, [PJ_VERSION: 493]
##  Path to PROJ.4 shared files: C:/Users/isaac/Documents/R/win-library/3.5/rgdal/proj
##  Linking to sp version: 1.3-2
library(tmap)

# Import OA boundaries
# Note: Depending on your R version, you may need to ass a second parameter of read OGR.
# If you get an error without it, add "OGRGeoJSON" as a second parameter (include the quotes).
liverpool_SP <- readOGR("./data/Liverpool_OA_2011.geojson")
## OGR data source with driver: GeoJSON 
## Source: "C:\Users\isaac\Documents\GitHub\urban_analytics\10_Data_Reduction_Geodemographics\data\Liverpool_OA_2011.geojson", layer: "Liverpool_OA_2011"
## with 1584 features
## It has 1 fields
# Merge lookup
# this merges data with information on OA Shapes stored in the GeoJSON
liverpool_SP <- merge(liverpool_SP, lookup, by.x="oa_code",by.y="OA")

#preferences for the map
#from tm_polygons - sets many properties about how data is displayed
  #col = super sets that the data which is being mapped is the super groups (columns)
  #border.col = sets the color of the borders between polygons
  #palette sets the colors of the polygons (the colors representing each group)
  #border.alpha sets the transparency  of the borders
  #title sets the title of the map shown on top of the legend
  #showNA = false means that no tracts with NA values are shown,and NA is not shown on the legend
#from tm_scalebar - sets that there will be a scalebar, auto updates with zoom in and out
  #position sets where scale bar appears
m <- tm_shape(liverpool_SP, projection=27700) +
    tm_polygons(
      col="SUPER", 
      border.col = "grey50",   
      palette="Set1", 
      border.alpha = .3, 
      title="Cluster aka Super Group", 
      showNA=FALSE) +
  tm_scale_bar(position = c("left", "bottom"))



#Create leaflet plot
tmap_leaflet(m)
## Warning: The shape liverpool_SP is invalid. See sf::st_is_valid
## Linking to GEOS 3.6.1, GDAL 2.2.3, PROJ 4.9.3

So wait what were those clusters AKA Super Groups?

(It’s Isaac again, cutting back in)

Essentially, they were 7 arbitrary groups which contained similar values to each other across the demographic, housing, and socio-economic variables considered here. These groups were generated weighting each of these variables equally, using the kmeans algorithm.

The OA’s which are in the same cluster do seem to often be next to other OA’s in their cluster in space. This is probably a good sign for the accuracy of the clustering, as areas which are near to each other tend to be similar. It is also important to note though, that there are OA’s in every cluster far from other OA’s in the same cluster. This means that there are OA’s which have similar traits but are not near to each other, which was to be expected.

(okay now back to Singleton)

Creating cluster descriptions and profiles

We can now contextualize the cluster assignments further by looking at the rates for input attributes within each cluster compared to the Liverpool average. This is a common technique by which labels and written descriptions can be ascribed to the clusters.

We will first create a table of index scores. These are calculated by dividing the rate for a particular variable within a cluster by the rate for the same variable within the total population. These are multiplied by 100 which means an average is set at 100. Thus, half the average would be 50, and double 200.

# Merge Original Data (inc. denominators)
LiVOAC_Lookup_Input <- merge(lookup,OAC_Input,by="OA",all.x=TRUE)

# Remove Ratio Variables
LiVOAC_Lookup_Input$k007 <- NULL
LiVOAC_Lookup_Input$k035 <- NULL

# Create Aggregations by SuperGroup (cluster)
SuperGroup <-aggregate(LiVOAC_Lookup_Input[,4:78], by=list(LiVOAC_Lookup_Input$SUPER),  FUN=sum)

# Create a data frame that will be used to append the index scores
G_Index <- data.frame(SUPER=LETTERS[1:7])

# Loop
for (n in 1:nrow(K_Var)){
  
  num <- paste(K_Var[n,"VariableCode"]) # Get numerator name
  den <- paste(K_Var[n,"Denominator"]) # Get denominator name
  tmp <- data.frame(round((SuperGroup[,num] / SuperGroup[,den]) / (sum(SuperGroup[,num])/sum(SuperGroup[,den]))*100)) # Calculate index score - these are also rounded
  colnames(tmp) <- num
  
  G_Index <- cbind(G_Index,tmp) # Append the index calculations
  
  # Remove temporary objects
  remove(list = c("tmp","num","den"))
}

# View the index scores
G_Index
##   SUPER k001 k002 k003 k004 k005 k006 k008 k009 k010 k011 k012 k013 k014
## 1     A   83   91   91  114  147  237   90   92   84  144  106   81   38
## 2     B   90  109   84  129  124  121   23   65  164   71  106   60   97
## 3     C  125  104  115   98   87   66    8   98  105  102  106   78   71
## 4     D  121  129   92  104  108   75   31   95   98  117  107   63   59
## 5     E   45   21  184   59   33   41   98  152   42   80   89  171  210
## 6     F   35   31   62   32   30   37  933  171   29   33   84  137  280
## 7     G  129  113  120   83   73   50   20  112   76  125   77  238  139
##   k015 k016 k017 k018 k019 k020 k021 k022 k023 k024 k025 k026 k027 k028
## 1   40   41   43   59   46  105   60   73   51   73   95    6   51   89
## 2   92   48   73   25   32  105   68   29   44  142  130   11  317  221
## 3   56   44   47   48   38  104   81   75   55  115  100   58   30   45
## 4   23   29   39   49   22  106   41   76   57   79  140    4   66  119
## 5  197   79  171  146  275   86  432  186  136  144   14  283   14    9
## 6  348  393  415  131  143   86  199  116  148   71   42 1040   31   26
## 7  195  305  186  408  408   84  134  291  357   68   70  128   57   59
##   k029 k030 k031 k032 k033 k034 k036 k037 k038 k039 k040 k041 k042 k043
## 1   82  155   70  183   61   92  109  100   62   64   44   57   97   83
## 2   24   30  181   11   43   23  123  110   84  143   54  236   80  157
## 3  185   43  127   31  128   59   98  113   94  105   64   93  128  115
## 4  147   11   87  164   48   69  113  117   68   45   54   65  104   87
## 5    8  375   39   67  263  302   53   57  112  227  148   61  105   87
## 6  101  199   54   71  231  261   42   47  322   97  480   77   72   42
## 7  109  143   48  152  140  159   82   93   84   88  101   39  111   65
##   k044 k045 k046 k047 k048 k049 k050 k051 k052 k053 k054 k055 k056 k057
## 1   83  128   98  101   56  114  112  107  103  126   90   76   93  120
## 2   61   53   91  104   73  115  106  104   87   94   58  119  121   70
## 3   98   97   90  104   98  100  100  104   98  106   80   97  107   92
## 4   89  132  111   95  110  120  121  123  113  126   93   54   82  136
## 5  260   79   64  117   98   48   67   61   78   58  132  209  117   78
## 6  114   38  178   64  295   37   43   25  143   44  258  102   60   76
## 7  117  158  112   95  108   81   90  102  105   89  165   82   80  137
##   k058 k059 k060
## 1   99   84  102
## 2  125  124   96
## 3  114  104  104
## 4   87   68  110
## 5   82  121   90
## 6   54  105   75
## 7   71   86  103

To assist with spotting trends within the grand index table we can visualize create a plot of shaded cells. Before doing this we will however need to convert the data into a narrow format. The simplest way of going this is with the melt function that is contained within the reshape2 package.

install.packages("reshape2")
library(reshape2)
# Convert from wide to narrow format
G_Index_Melt <- melt(G_Index, id.vars="SUPER")
# View the top of the new narrow formatted data frame
head(G_Index_Melt)
##   SUPER variable value
## 1     A     k001    83
## 2     B     k001    90
## 3     C     k001   125
## 4     D     k001   121
## 5     E     k001    45
## 6     F     k001    35

We then need to create a new variable that we will use on the plot to create color bins. It is common practice when building a geodemographic to create aggregations of less than or equal to 80, between 81 and 120 and greater than 120. We implement the recoding here with an ifelse() function; and then re-order the created factors so that they display from lowest to highest on the plot. Finally, we will import and merge a final column which is a set of shortened variable descriptions - however, if you want the full versions (which are too long for the plot) - remember these can be found in the “OAC_Input_Lookup” data frame.

# Recode the index scores into aggregate groupings
G_Index_Melt$band <- ifelse(G_Index_Melt$value <= 80,"< 80",ifelse(G_Index_Melt$value > 80 & G_Index_Melt$value <= 120,"80-120",">120"))

# Add a column with short descriptions of the variables
short <- read.csv("./data/OAC_Input_Lookup_short_labels.csv")
G_Index_Melt <- merge(G_Index_Melt,short,by.x="variable",by.y="VariableCode",all.x=TRUE)

# Order the created factors appropriately - needed to ensure the legend and axis make sense in ggolot2
G_Index_Melt$band <- factor(G_Index_Melt$band, levels = c("< 80","80-120",">120"))
G_Index_Melt$VariableDescription <- factor(G_Index_Melt$VariableDescription, levels = short$VariableDescription)

Using ggplot2 we can now create a shaded table which you can use to come up with descriptions of the clusters and creative labels.

library(ggplot2)
p <- ggplot(G_Index_Melt, aes(x=SUPER, y=VariableDescription, label=value, fill=band)) + 
  scale_fill_manual(name = "Band",values = c("#EB753B","#F7D865","#B3D09F")) +
  scale_x_discrete(position = "top") +
  geom_tile(alpha=0.8) +
  geom_text(colour="black")
p

Further resources / training